4 research outputs found

    Rumble: Data Independence for Large Messy Data Sets

    Full text link
    This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

    Machine Learning with JSONiq

    No full text

    RumbleML: program the lakehouse with JSONiq

    No full text
    Lakehouse systems have reached in the past few years unprecedented size and heterogeneity and have been embraced by many industry players. However, they are often difficult to use as they lack the declarative language and optimization possibilities of relational engines. This paper introduces RumbleML, a high-level, declarative library integrated into the RumbleDB engine and with the JSONiq language. RumbleML allows using a single platform for data cleaning, data preparation, training, and inference, as well as management of models and results. It does it using a purely declarative language (JSONiq) for all these tasks and without any performance loss over existing platforms (e.g. Spark). The key insights of the design of RumbleML are that training sets, evaluation sets, and test sets can be represented as homogeneous sequences of flat objects; that models can be seamlessly embodied in function items mapping input test sets into prediction-augmented result sets; and that estimators can be seamlessly embodied in function items mapping input training sets to models. We argue that this makes JSONiq a viable and seamless programming language for data lakehouses across all their features, whether database-related or machine-learning-related. While lakehouses bring Machine Learning and Data Wrangling on the same platform, RumbleML also brings them to the same language, JSONiq. In the paper, we present the first prototype and compare its performance to Spark showing the benefit of a huge functionality and productivity gain for cleaning up, normalizing, validating data, feeding it into Machine Learning pipelines, and analyzing the output, all within the same system and language and at scale

    Rumble: Data Independence for Large Messy Data Sets

    No full text
    This paper introduces Rumble, a query execution engine for large, heterogeneous, and nested collections of JSON objects built on top of Apache Spark. While data sets of this type are more and more wide-spread, most existing tools are built around a tabular data model, creating an impedance mismatch for both the engine and the query interface. In contrast, Rumble uses JSONiq, a standardized language specifically designed for querying JSON documents. The key challenge in the design and implementation of Rumble is mapping the recursive structure of JSON documents and JSONiq queries onto Spark's execution primitives based on tabular data frames. Our solution is to translate a JSONiq expression into a tree of iterators that dynamically switch between local and distributed execution modes depending on the nesting level. By overcoming the impedance mismatch in the engine, Rumble frees the user from solving the same problem for every single query, thus increasing their productivity considerably. As we show in extensive experiments, Rumble is able to scale to large and complex data sets in the terabyte range with a similar or better performance than other engines. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.ISSN:2150-809
    corecore